Skip to content

transport: drop outlier RTT samples above MaxReasonableRTTMs (30s)#2421

Merged
0pcom merged 1 commit intoskycoin:developfrom
0pcom:fix/transport-latency-rtt-upper-bound
May 3, 2026
Merged

transport: drop outlier RTT samples above MaxReasonableRTTMs (30s)#2421
0pcom merged 1 commit intoskycoin:developfrom
0pcom:fix/transport-latency-rtt-upper-bound

Conversation

@0pcom
Copy link
Copy Markdown
Collaborator

@0pcom 0pcom commented May 3, 2026

Summary

Production observation post-#2415/#2418: three sudph transports show min/avg ~190ms but max ~33-35 seconds in TPD's /metrics. The min/avg are correct; the max is a single straggler pong arriving long after its ping — handleTransportPong correlates pong to ping only by timestamp, with no in-flight tracking that would discard a pong arriving after the implied timeout. Once it lands, Max only ever grows, so a single bad sample pins the transport's Max for the life of the lat:<id> key (35 days).

Real intercontinental RTT plus heavy queueing is comfortably under 5s, so 30s is a generous upper bound for "obviously bogus".

Fix

Mirror the existing zero-clobber lower-bound guard with an upper-bound at every layer:

  • transport.MaxReasonableRTTMs — exported constant, 30_000.0 ms.
  • SetLatency drops latencyMs > MaxReasonableRTTMs (alongside the existing <= 0).
  • SetLatencyStats rejects the whole snapshot if any of min/max/avg exceeds the bound.
  • handleTransportPong short-circuits before SetLatency, Trace-logs the dropped value.
  • RouteGroup.MeasureLatency drops outlier samples from measurements so a stale pong from the periodic pingLoop landing in the buffered pongCh can't poison an active measurement's min/max/avg.
  • RouteGroup.handlePongPacket rejects up front so neither networkStats, the synchronous pendingPongCh consumer, nor the underlying transport see the bogus value.

Test plan

  • go build ./... clean.
  • go vet ./pkg/transport/... ./pkg/router/... clean.
  • go test ./pkg/transport/ ./pkg/router/ ./pkg/transport-discovery/... all pass.
  • gofmt clean.
  • New regression tests:
    • TestSetLatencyDropsOutliers — rejects the 35s production case, accepts exactly at the bound, valid samples after rejection still update the running stats.
    • TestSetLatencyStatsDropsOutliers — table test rejects partial-outlier shapes (over-bound min/max/avg in any combination), accepts a fully-in-bound snapshot.
  • After deploy, in skywire cli tp metrics --by-transport: no transport's latency.max exceeds 30,000 ms; production cases like 51eced2f, f4c0e8c7, f95d6019 (currently min/avg ~190ms, max ~32-35s) settle to plausible max within ~5 minutes of a CXO push.

Followup to

#2415 (zero-clobber) and #2418 (persistent latency). The 35s outliers were the first thing visible once latency started persisting, since transient bad samples used to disappear with the registration TTL — they now stay around for 35 days.

Production observation: three sudph transports persisted with
min/avg ~190ms and max ~33-35 SECONDS in /metrics. The min/avg are
correct; the max is a single straggler pong arriving long after its
ping (delayed packet, host suspended, NIC queue stall). Once it
lands, Max only ever grows, so a single bad sample pins the
transport's Max for the life of the lat:<id> key (35 days).

handleTransportPong correlates pongs to pings only by timestamp —
there's no in-flight tracking that would discard a pong arriving N
seconds after timeout. Real intercontinental RTT plus heavy
queueing is comfortably under 5s, so 30s is a generous upper bound
for "obviously bogus".

Mirror the existing zero-clobber lower-bound guard with an
upper-bound at every layer:

  - transport.MaxReasonableRTTMs (exported, 30_000.0).
  - SetLatency drops latencyMs > MaxReasonableRTTMs (alongside <=0).
  - SetLatencyStats rejects the whole snapshot if any of min/max/avg
    exceeds the bound.
  - handleTransportPong short-circuits before SetLatency, with a
    Trace log of the dropped value.
  - RouteGroup.MeasureLatency drops samples >MaxReasonableRTTMs from
    `measurements` so a stale pong from the periodic pingLoop
    leaking into the buffered pongCh can't poison the active
    measurement's min/max/avg.
  - RouteGroup.handlePongPacket rejects up front so neither
    networkStats, the synchronous MeasureLatency consumer, nor the
    underlying transport see the bogus value.

TestSetLatencyDropsOutliers and TestSetLatencyStatsDropsOutliers
pin both the rejection and the boundary-exact acceptance.
@0pcom 0pcom merged commit 0a7e12b into skycoin:develop May 3, 2026
3 of 4 checks passed
0pcom added a commit that referenced this pull request May 4, 2026
…iminator (#2422)

Production TPD was restarting every 30-40s (RestartCount 556 over 6h
on the prod host) because two distinct panics tear down the process:

1. pkg/cxo/skyobject/cache.go (*Cache).Finc:1189,1216
   panic: "Finc to negative for: <hash>"
   The filling-refcount went below zero — likely a duplicate Finc on
   a Filler.incs map, or an Inc/Finc mismatch across overlapping
   fillers. Hard process kill via panic. Filler.apply / Filler.reject
   already consume Finc's error return and just log; surface the
   inconsistency through that path instead. Clamp fc to 0, log the
   condition with key and the offending inc, and continue. Worst
   case is a leaked filling-item slot — orders of magnitude better
   than killing the service.

2. pkg/httputil/httputil.go WriteJSON:50
   panic: "short write: i/o deadline reached"
   isIOError checks errors.Is(err, io.ErrShortWrite), but
   net/http's timeoutWriter returns its own error value with the
   same message string when a write deadline expires mid-response.
   errors.Is misses it (different sentinel), the fallback string
   match didn't include "short write", so getAllTransports'
   ~1MB JSON write to a slow client panics on every deadline hit.
   Added "short write", "i/o timeout", and "deadline exceeded"
   to the string-match fallback. New TestIsIOErrorShortWriteVariants
   pins all sentinel and string-match cases.

Neither bug is caused by #2415/#2418/#2421; they were just made
visible because deploys cycling at 30-40s aren't subtle. Together
these stop the panic loop without changing any data semantics.
0pcom added a commit that referenced this pull request May 4, 2026
Visors running pre-#2421 binaries push outlier max samples (e.g. 35s
RTT from a straggler pong) via CXO. TPD's UpdateLatency only had a
lower-bound guard (avg<=0 or any field <=0 → drop), so those reach
the lat:<id> store and pin Max for the 35-day retention regardless
of how many later good samples land.

Production right now: 5 transports show max>30s, all written ~10h
ago, ages-out date 35d from each write. Won't go away on its own.

Reject the same way the visor side does (transport.MaxReasonableRTTMs,
30s) so TPD's defense-in-depth matches the visor's. Old visors keep
pushing bogus values; TPD now silently drops them, and the next good
sample from any peer overwrites the stale stored max.

The package already imports pkg/transport (via redis_transport.go)
so MaxReasonableRTTMs comes free with no new dep.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant